Fundamental Techniques in Data Science with R
Merging Data / Data Table
Linear models
Robust linear modeling
Logistic models
Visualizing marginal effects
Time series in R
Principal components analysis
Clustering
library(dplyr)
## ## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats': ## ## filter, lag
## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union
library(magrittr) library(DAAG)
## Loading required package: lattice
library(mice)
## ## Attaching package: 'mice'
## The following objects are masked from 'package:base': ## ## cbind, rbind
library(bench)
## ## Attaching package: 'bench'
## The following object is masked from 'package:DAAG': ## ## press
library(ggplot2) # Plotting suite library(class) # K-nearest Neighbour library(mvtnorm) # Multivariate Normal tools library(psych)
## ## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2': ## ## %+%, alpha
library(caret)
RLet’s consider the following data example:
df1 <- data.frame(id = c(1:6),
city = c(rep("Utrecht", 3), rep("Rotterdam", 3)))
df2 <- data.frame(ID = c(2, 4, 6),
year = c(1984, 1982, 1990))
head(df1)
## id city ## 1 1 Utrecht ## 2 2 Utrecht ## 3 3 Utrecht ## 4 4 Rotterdam ## 5 5 Rotterdam ## 6 6 Rotterdam
head(df2)
## ID year ## 1 2 1984 ## 2 4 1982 ## 3 6 1990
merge(x = df1, y = df2, by.x = "id", by.y = "ID")
## id city year ## 1 2 Utrecht 1984 ## 2 4 Rotterdam 1982 ## 3 6 Rotterdam 1990
This is a natural join, or inner join. Only the cases from df1 that have a match in df2 are returned.
merge(x = df1, y = df2, by.x = "id", by.y = "ID", all = TRUE)
## id city year ## 1 1 Utrecht NA ## 2 2 Utrecht 1984 ## 3 3 Utrecht NA ## 4 4 Rotterdam 1982 ## 5 5 Rotterdam NA ## 6 6 Rotterdam 1990
All rows are returned
merge(y = df1, x = df2, by.y = "id", by.x = "ID", all = TRUE)
## ID year city ## 1 1 NA Utrecht ## 2 2 1984 Utrecht ## 3 3 NA Utrecht ## 4 4 1982 Rotterdam ## 5 5 NA Rotterdam ## 6 6 1990 Rotterdam
merge(x = df1, y = df2, by.x = "id", by.y = "ID", all.x = TRUE)
## id city year ## 1 1 Utrecht NA ## 2 2 Utrecht 1984 ## 3 3 Utrecht NA ## 4 4 Rotterdam 1982 ## 5 5 Rotterdam NA ## 6 6 Rotterdam 1990
merge(y = df1, x = df2, by.y = "id", by.x = "ID", all.x = TRUE)
## ID year city ## 1 2 1984 Utrecht ## 2 4 1982 Rotterdam ## 3 6 1990 Rotterdam
merge(x = df1, y = df2, by.x = "id", by.y = "ID", all.y = TRUE)
## id city year ## 1 2 Utrecht 1984 ## 2 4 Rotterdam 1982 ## 3 6 Rotterdam 1990
dplyrband_members
## # A tibble: 3 x 2 ## name band ## <chr> <chr> ## 1 Mick Stones ## 2 John Beatles ## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2 ## name plays ## <chr> <chr> ## 1 John guitar ## 2 Paul bass ## 3 Keith guitar
band_members %>% inner_join(band_instruments)
## Joining, by = "name"
## # A tibble: 2 x 3 ## name band plays ## <chr> <chr> <chr> ## 1 John Beatles guitar ## 2 Paul Beatles bass
dplyr::inner_join(x, y) returns all rows from x where there are matching values in y, and all columns from x and y.
x and y, all combination of the matches are returned.band_members %>% left_join(band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3 ## name band plays ## <chr> <chr> <chr> ## 1 Mick Stones <NA> ## 2 John Beatles guitar ## 3 Paul Beatles bass
dplyr::left_join(x, y) returns all rows from x, and all columns from x and y.
x with no match in y will have NA values in the new columns.x and y, all combinations of the matches are returned.band_members %>% right_join(band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3 ## name band plays ## <chr> <chr> <chr> ## 1 John Beatles guitar ## 2 Paul Beatles bass ## 3 Keith <NA> guitar
dplyr::right_join(x, y) returns all rows from y, and all columns from x and y.
y with no match in x will have NA values in the new columns.x and y, all combinations of the matches are returned.band_members %>% full_join(band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3 ## name band plays ## <chr> <chr> <chr> ## 1 Mick Stones <NA> ## 2 John Beatles guitar ## 3 Paul Beatles bass ## 4 Keith <NA> guitar
dplyr::full_join(x, y) returns all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
dplyr::semi_join(x, y)band_members %>% semi_join(band_instruments, by = "name")
## # A tibble: 2 x 2 ## name band ## <chr> <chr> ## 1 John Beatles ## 2 Paul Beatles
dplyr::anti_join(x, y)band_members %>% anti_join(band_instruments, by = "name")
## # A tibble: 1 x 2 ## name band ## <chr> <chr> ## 1 Mick Stones
band_members %>% nest_join(band_instruments, by = "name")
## # A tibble: 3 x 3 ## name band band_instruments ## * <chr> <chr> <list> ## 1 Mick Stones <tibble [0 × 1]> ## 2 John Beatles <tibble [1 × 1]> ## 3 Paul Beatles <tibble [1 × 1]>
a <- band_members %>% nest_join(band_instruments)
## Joining, by = "name"
a$band_instruments
## [[1]] ## # A tibble: 0 x 1 ## # … with 1 variable: plays <chr> ## ## [[2]] ## # A tibble: 1 x 1 ## plays ## <chr> ## 1 guitar ## ## [[3]] ## # A tibble: 1 x 1 ## plays ## <chr> ## 1 bass
Tibbles are a modern take on data frames with a couple of improvements. For example, tibbles are more memory efficient:
set.seed(123) l <- replicate(26, sample(1000), simplify = FALSE) names(l) <- letters timing <- bench::mark( as_tibble(l), as.data.frame(l), check = FALSE) timing
## # A tibble: 2 x 6 ## expression min median `itr/sec` mem_alloc `gc/sec` ## <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> ## 1 as_tibble(l) 252µs 446.73µs 2157. 189.9KB 4.29 ## 2 as.data.frame(l) 996µs 1.77ms 552. 4.91KB 4.12
tibblea <- data.frame(abc = 1) a$a
## [1] 1
a <- tibble(abc = 1) a$a
## Warning: Unknown or uninitialised column: 'a'.
## NULL
a$abc
## [1] 1
The mathematical formulation of the relationship between variables can be written as
\[ \mbox{observed}=\mbox{predicted}+\mbox{error} \]
or (for the greek people) in notation as \[y=\mu+\varepsilon\]
where
There are four key assumptions about the use of linear regression models.
In short, we assume
\[ y = \alpha + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon\]
The residuals are normally distributed with mean \(\mu_\epsilon = 0\)
fit <- anscombe %$% lm(y1 ~ x1) fit
## ## Call: ## lm(formula = y1 ~ x1) ## ## Coefficients: ## (Intercept) x1 ## 3.0001 0.5001
fit2 <- anscombe %$% lm(y2 ~ x2)
par(mfrow = c(2, 2)) plot(fit)
par(mfrow = c(2, 2)) plot(fit2)
boys.fit <- na.omit(boys) %$% # Extremely wasteful lm(age ~ reg) boys.fit
## ## Call: ## lm(formula = age ~ reg) ## ## Coefficients: ## (Intercept) regeast regwest regsouth regcity ## 14.7109 -0.8168 -0.7044 -0.6913 -0.6663
boys %>% na.omit(boys) %$% aggregate(age, list(reg), mean)
## Group.1 x ## 1 north 14.71094 ## 2 east 13.89410 ## 3 west 14.00657 ## 4 south 14.01965 ## 5 city 14.04460
means <- boys %>% na.omit(boys) %>% group_by(reg) %>% summarise(age = mean(age)) ggplot(na.omit(boys), aes(x = reg, y = age)) + geom_point(color = "grey") + geom_point(data = means, stat = "identity", size = 3)
boys.fit %>% summary()
## ## Call: ## lm(formula = age ~ reg) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.8519 -2.5301 0.0254 2.2274 6.2614 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 14.7109 0.5660 25.993 <2e-16 *** ## regeast -0.8168 0.7150 -1.142 0.255 ## regwest -0.7044 0.6970 -1.011 0.313 ## regsouth -0.6913 0.6970 -0.992 0.322 ## regcity -0.6663 0.9038 -0.737 0.462 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.151 on 218 degrees of freedom ## Multiple R-squared: 0.006745, Adjusted R-squared: -0.01148 ## F-statistic: 0.3701 on 4 and 218 DF, p-value: 0.8298
If you have trouble reading scientific notation, 2e-16 means the following
\[2\text{e-16} = 2 \times 10^{-16} = 2 \times (\frac{1}{10})^{-16}\]
This indicates that the comma should be moved 16 places to the left:
\[2\text{e-16} = 0.0000000000000002\]
boys.fit %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.7 3.6747 0.3701 0.8298 ## Residuals 218 2164.6 9.9293
It is not a very informative model. The anova is not significant, indicating that the contribution of the residuals is larger than the contribution of the model.
The outcome age does not change significantly when reg is varied.
Akaike’s An Information Criterion
boys.fit %>% AIC()
## [1] 1151.684
AIC comes from information theory and can be used for model selection. The AIC quantifies the information that is lost by the statistical model, through the assumption that the data come from the same model. In other words: AIC measures the fit of the model to the data.
Let’s add predictor hgt to the model:
boys.fit2 <- na.omit(boys) %$% lm(age ~ reg + hgt) boys.fit %>% AIC()
## [1] 1151.684
boys.fit2 %>% AIC()
## [1] 836.3545
Let’s add wgt to the model
boys.fit3 <- na.omit(boys) %$% lm(age ~ reg + hgt + wgt)
Let’s add wgt and the interaction between wgt and hgt to the model
boys.fit4 <- na.omit(boys) %$% lm(age ~ reg + hgt * wgt)
is equivalent to
boys.fit4 <- na.omit(boys) %$% lm(age ~ reg + hgt + wgt + hgt:wgt)
AIC(boys.fit, boys.fit2, boys.fit3, boys.fit4)
## df AIC ## boys.fit 6 1151.6839 ## boys.fit2 7 836.3545 ## boys.fit3 8 822.4164 ## boys.fit4 9 823.0386
anova()anova(boys.fit, boys.fit2, boys.fit3, boys.fit4)
## Analysis of Variance Table ## ## Model 1: age ~ reg ## Model 2: age ~ reg + hgt ## Model 3: age ~ reg + hgt + wgt ## Model 4: age ~ reg + hgt * wgt ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 218 2164.59 ## 2 217 521.64 1 1642.94 731.8311 < 2.2e-16 *** ## 3 216 485.66 1 35.98 16.0276 8.595e-05 *** ## 4 215 482.67 1 2.99 1.3324 0.2497 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit3boys.fit3 %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.70 3.67 1.6343 0.1667 ## hgt 1 1642.94 1642.94 730.7065 < 2.2e-16 *** ## wgt 1 35.98 35.98 16.0029 8.688e-05 *** ## Residuals 216 485.66 2.25 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boys.fit4boys.fit4 %>% anova()
## Analysis of Variance Table ## ## Response: age ## Df Sum Sq Mean Sq F value Pr(>F) ## reg 4 14.70 3.67 1.6368 0.1661 ## hgt 1 1642.94 1642.94 731.8311 < 2.2e-16 *** ## wgt 1 35.98 35.98 16.0276 8.595e-05 *** ## hgt:wgt 1 2.99 2.99 1.3324 0.2497 ## Residuals 215 482.67 2.24 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It seems that reg and the interaction hgt:wgt are redundant
regboys.fit5 <- na.omit(boys) %$% lm(age ~ hgt + wgt)
Let’s revisit the comparison
anova(boys.fit, boys.fit2, boys.fit3, boys.fit5)
## Analysis of Variance Table ## ## Model 1: age ~ reg ## Model 2: age ~ reg + hgt ## Model 3: age ~ reg + hgt + wgt ## Model 4: age ~ hgt + wgt ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 218 2164.59 ## 2 217 521.64 1 1642.94 730.7065 < 2.2e-16 *** ## 3 216 485.66 1 35.98 16.0029 8.688e-05 *** ## 4 220 492.43 -4 -6.77 0.7529 0.5571 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The boys.fit5 model is better than the previous model - it has fewer parameters
We start with the full model, which contains all parameters for all columns.
The most straightforward way to go about this is by specifying the following model:
full.model <- lm(age ~ ., data = na.omit(boys)) full.model
## ## Call: ## lm(formula = age ~ ., data = na.omit(boys)) ## ## Coefficients: ## (Intercept) hgt wgt bmi hc ## 2.556051 0.059987 -0.009846 0.142162 -0.024086 ## gen.L gen.Q gen.C gen^4 phb.L ## 1.287455 -0.006861 -0.187256 0.034186 1.552398 ## phb.Q phb.C phb^4 phb^5 tv ## 0.499620 0.656277 -0.094722 -0.113686 0.074321 ## regeast regwest regsouth regcity ## -0.222249 -0.233307 -0.258771 0.188423
We can then start with specifying the stepwise model. In this case we choose direction both.
step.model <- step(full.model, direction = "both",
trace = FALSE)
step.model
## ## Call: ## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys)) ## ## Coefficients: ## (Intercept) hgt bmi phb.L phb.Q ## 2.07085 0.05482 0.10742 2.70328 0.28877 ## phb.C phb^4 phb^5 tv ## 0.48492 -0.06225 -0.15167 0.07957
Other options are
forward: fit all univariate models, add the best predictor and continue.backward: fit the full model, eliminate the worst predictor and continue.step.model %>% summary
## ## Call: ## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.5077 -0.7144 -0.0814 0.6850 3.1724 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.070854 1.576365 1.314 0.19036 ## hgt 0.054822 0.009986 5.490 1.13e-07 *** ## bmi 0.107415 0.030135 3.564 0.00045 *** ## phb.L 2.703275 0.437108 6.184 3.12e-09 *** ## phb.Q 0.288768 0.211836 1.363 0.17426 ## phb.C 0.484921 0.202856 2.390 0.01769 * ## phb^4 -0.062246 0.208472 -0.299 0.76555 ## phb^5 -0.151667 0.240599 -0.630 0.52912 ## tv 0.079573 0.019600 4.060 6.89e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.172 on 214 degrees of freedom ## Multiple R-squared: 0.8651, Adjusted R-squared: 0.86 ## F-statistic: 171.5 on 8 and 214 DF, p-value: < 2.2e-16
full.model <- lm(age ~ ., data = na.omit(boys))
step.model <- MASS::stepAIC(full.model, direction = "both",
trace = FALSE)
step.model
## ## Call: ## lm(formula = age ~ hgt + bmi + phb + tv, data = na.omit(boys)) ## ## Coefficients: ## (Intercept) hgt bmi phb.L phb.Q ## 2.07085 0.05482 0.10742 2.70328 0.28877 ## phb.C phb^4 phb^5 tv ## 0.48492 -0.06225 -0.15167 0.07957
DfBeta calculates the change in coefficients depicted as deviation in SE’s.
step.model %>% dfbeta() %>% head(n = 7)
## (Intercept) hgt bmi phb.L phb.Q ## 3279 -0.142127090 0.0010828575 -0.0021654085 -0.021523825 -0.0006087783 ## 3283 -0.005201914 0.0001127452 -0.0014100750 0.009068652 -0.0147293625 ## 3296 -0.081439010 0.0004244061 0.0002603379 -0.009247562 -0.0071293767 ## 3321 -0.084630826 0.0004186923 0.0008222594 -0.005498486 -0.0059276580 ## 3323 0.154386486 -0.0004476521 -0.0052964367 0.042872835 -0.0213761347 ## 3327 -0.046095468 0.0001081361 0.0011150546 -0.004047875 -0.0085461280 ## 3357 -0.045933397 0.0001345605 0.0009770371 -0.005087053 -0.0062604484 ## phb.C phb^4 phb^5 tv ## 3279 0.007427067 -0.004226982 -0.001213833 0.0001061106 ## 3283 0.011761595 -0.005498411 0.001164307 0.0007051901 ## 3296 0.008538302 -0.003688611 0.000842872 0.0002494383 ## 3321 0.006839217 -0.003341741 0.000866896 -0.0002569914 ## 3323 0.011758694 -0.006659755 0.000493897 0.0012317886 ## 3327 0.007807966 -0.002901270 0.001499505 0.0003376220 ## 3357 0.006152384 -0.002274777 0.001148552 0.0002329346
Let’s use the simpler anscombe data example
fit <- anscombe %$% lm(y1 ~ x1) y_hat <- fit %>% fitted.values()
The residual is then calculated as
y_hat - anscombe$y1
## 1 2 3 4 5 6 ## -0.03900000 0.05081818 1.92127273 -1.30909091 0.17109091 0.04136364 ## 7 8 9 10 11 ## -1.23936364 0.74045455 -1.83881818 1.68072727 -0.17945455
If we introduce new values for the predictor x1, we can generate predicted values from the model
new.x1 <- data.frame(x1 = 1:20) fit %>% predict(newdata = new.x1)
## 1 2 3 4 5 6 7 ## 3.500182 4.000273 4.500364 5.000455 5.500545 6.000636 6.500727 ## 8 9 10 11 12 13 14 ## 7.000818 7.500909 8.001000 8.501091 9.001182 9.501273 10.001364 ## 15 16 17 18 19 20 ## 10.501455 11.001545 11.501636 12.001727 12.501818 13.001909
pred <- fit %>% predict(newdata = new.x1) lm(pred ~ new.x1$x1)$coefficients
## (Intercept) new.x1$x1 ## 3.0000909 0.5000909
fit$coefficients
## (Intercept) x1 ## 3.0000909 0.5000909
fit %>% predict(interval = "prediction")
## fit lwr upr ## 1 8.001000 5.067072 10.934928 ## 2 7.000818 4.066890 9.934747 ## 3 9.501273 6.390801 12.611745 ## 4 7.500909 4.579129 10.422689 ## 5 8.501091 5.531014 11.471168 ## 6 10.001364 6.789620 13.213107 ## 7 6.000636 2.971271 9.030002 ## 8 5.000455 1.788711 8.212198 ## 9 9.001182 5.971816 12.030547 ## 10 6.500727 3.530650 9.470804 ## 11 5.500545 2.390073 8.611017
A prediction interval reflects the uncertainty around a single value. The confidence interval reflects the uncertainty around the mean prediction values.
Do for all \(j\in\{1,\dots,k\}\) training sets
Small difference suggests good predictive accuracy
fit %>% summary()
## ## Call: ## lm(formula = y1 ~ x1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
anscombe dataDAAG::CVlm(anscombe, fit, plotit=F, printit=T)
## Analysis of Variance Table ## ## Response: y1 ## Df Sum Sq Mean Sq F value Pr(>F) ## x1 1 27.5 27.51 18 0.0022 ** ## Residuals 9 13.8 1.53 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## ## fold 1 ## Observations in test set: 3 ## 5 7 11 ## x1 11.000 6.00 5.000 ## cvpred 8.443 5.59 5.014 ## y1 8.330 7.24 5.680 ## CV residual -0.113 1.65 0.666 ## ## Sum of squares = 3.19 Mean square = 1.06 n = 3 ## ## fold 2 ## Observations in test set: 4 ## 2 6 8 10 ## x1 8.000 14.000 4.00 7.00 ## cvpred 7.572 9.681 6.17 7.22 ## y1 6.950 9.960 4.26 4.82 ## CV residual -0.622 0.279 -1.91 -2.40 ## ## Sum of squares = 9.86 Mean square = 2.47 n = 4 ## ## fold 3 ## Observations in test set: 4 ## 1 3 4 9 ## x1 10.00 13.00 9.00 12.00 ## cvpred 7.84 9.37 7.33 8.86 ## y1 8.04 7.58 8.81 10.84 ## CV residual 0.20 -1.79 1.48 1.98 ## ## Sum of squares = 9.35 Mean square = 2.34 n = 4 ## ## Overall (Sum over all 4 folds) ## ms ## 2.04
anscombe dataDAAG::CVlm(anscombe, fit, plotit="Observed", printit=F)
DAAG::CVlm(anscombe, fit, plotit="Residual", printit=F)
boys dataDAAG::CVlm(na.omit(boys), step.model, plotit="Observed", printit=F)
DAAG::CVlm(na.omit(boys), step.model, plotit="Residual", printit=F)
na.omit(boys) %$% lm(age ~ reg + hgt * wgt) %>% nobs()
## [1] 223
If we would not have used na.omit()
boys %$% lm(age ~ reg + hgt * wgt) %>% nobs()
## [1] 724
We start this lecture with a data set that logs the survival of passengers on board of the disastrous maiden voyage of the ocean liner Titanic
titanic <- read.csv(file = "titanic.csv", header = TRUE, ) titanic %>% describe
## vars n mean sd median trimmed mad min ## Survived 1 887 0.39 0.49 0.0 0.36 0.0 0.00 ## Pclass 2 887 2.31 0.84 3.0 2.38 0.0 1.00 ## Name* 3 887 444.00 256.20 444.0 444.00 329.1 1.00 ## Sex* 4 887 1.65 0.48 2.0 1.68 0.0 1.00 ## Age 5 887 29.47 14.12 28.0 28.96 11.9 0.42 ## Siblings.Spouses.Aboard 6 887 0.53 1.10 0.0 0.27 0.0 0.00 ## Parents.Children.Aboard 7 887 0.38 0.81 0.0 0.19 0.0 0.00 ## Fare 8 887 32.31 49.78 14.4 21.50 10.3 0.00 ## max range skew kurtosis se ## Survived 1 1.0 0.47 -1.78 0.02 ## Pclass 3 2.0 -0.62 -1.29 0.03 ## Name* 887 886.0 0.00 -1.20 8.60 ## Sex* 2 1.0 -0.61 -1.63 0.02 ## Age 80 79.6 0.45 0.28 0.47 ## Siblings.Spouses.Aboard 8 8.0 3.67 17.64 0.04 ## Parents.Children.Aboard 6 6.0 2.73 9.63 0.03 ## Fare 512 512.3 4.76 32.99 1.67
str(titanic)
## 'data.frame': 887 obs. of 8 variables: ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... ## $ Name : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ... ## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... ## $ Age : num 22 38 26 35 35 27 54 2 27 14 ... ## $ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
We have information on the following features.
Our outcome/dependent variable:
Some potential predictors:
c(male, female)and more.
We can start investigating if there are patterns in this data that are related to the survival probability.
For example, we could hypothesize based on the crede “women and children first” that
Age relates to the probability of survival in that younger passengers have a higher probability of survivalSex relates to survival in that females have a higher probability of survivalBased on socio-economic status, we could hypothesize that
Pclass relates to the probability of survival in that higher travel class leads to a higher probability of survivalAnd so on.
Age related?titanic %>% ggplot(aes(x = Age, y = Survived)) + geom_point() +
geom_smooth(method = "glm",
method.args = list(family = "binomial"),
se = FALSE) + xlim(-1, 100)
titanic %$% table(Pclass, Survived)
## Survived ## Pclass 0 1 ## 1 80 136 ## 2 97 87 ## 3 368 119
It seems that the higher the class (i.e. 1 > 2 > 3), the higher the probability of survival.
We can verify this
titanic %$% table(Pclass, Survived) %>% prop.table(margin = 1) %>% round(digits = 2)
## Survived ## Pclass 0 1 ## 1 0.37 0.63 ## 2 0.53 0.47 ## 3 0.76 0.24
Let’s fit these models and gradually build them up in the number of predictors.
fit1 <- titanic %$% glm(Survived ~ Age, family = binomial(link="logit")) fit2 <- titanic %$% glm(Survived ~ Sex, family = binomial(link="logit")) fit3 <- titanic %$% glm(Survived ~ Pclass, family = binomial(link="logit"))
Survived ~ Agetitanic %$% histogram(~ Age | Survived == 1)
The distribution of Age for the survivors (TRUE) is different from the distribution of Age for the non-survivors (FALSE). Especially at the younger end there is a point mass for the survivors, which indicates that children have a higher probability of survival. However, it is not dramatically different.
Survived ~ Agefit1 %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.20919 0.15949 -1.31 0.1897 ## Age -0.00877 0.00495 -1.77 0.0761
We can see that there is a trend. The log odds of Survived decreases with increasing Age. For every year increase in age, the log-odds of Survived decreases with -0.009. However, this decrease is not too convincing given the effect, the standard error and the size of the data. Hence, the p-value is a little on the larger side.
When we inspect the deviance
c(fit1$null.deviance, fit1$deviance)
## [1] 1183 1180
We see that there is almost no decrease in deviance between the empty model (i.e. the model with only an intercept) and the model with Age included. The difference in df is only 1 parameter.
Survived ~ Sextitanic %$% histogram(~ Sex | Survived == 1)
Wow! These distributions are very different! Females seem to have a much higher probability of survival.
Survived ~ Sexfit2 %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.06 0.129 8.19 2.58e-16 ## Sexmale -2.51 0.167 -14.98 9.95e-51
The coefficients confirm this. The log odds of Survived decreases for males, when compared to females. The decrease is quite pronounced, too: -2.505.
When we inspect the deviance
c(fit2$null.deviance, fit2$deviance)
## [1] 1183 916
We see that there is a massive gain in the deviance between the empty model (i.e. the model with only an intercept) and the model with Sex included. The difference in df is only 1 parameter.
Survived ~ Pclasstitanic %$% histogram(~ Pclass | Survived == 1)
There is a very apparent difference between the distributions of the survivors and non-survivors over the classes. For example, we see that in 1st and 2nd class there are more survivors than non-survivors, while in the third class this relation is opposite.
Survived ~ Pclassfit3 %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.439 0.2074 6.94 3.98e-12 ## Pclass -0.844 0.0872 -9.68 3.57e-22
Here, there is something wrong! Class is an ordered categorical variable, not a continuous variable. In other words, we cannot simply increase the log odds of Survived with -0.844 for every unit increase in Pclass.
If we would interpret these coefficients ‘as is’, we would assume that class two is twice the value for class 1, that class three is 1.5 times the value for class 2, and so on.
titanic %<>% mutate(Pclass = factor(Pclass, levels = c(3, 2, 1), ordered = FALSE)) str(titanic)
## 'data.frame': 887 obs. of 8 variables: ## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ... ## $ Pclass : Factor w/ 3 levels "3","2","1": 1 3 1 3 1 1 3 1 1 2 ... ## $ Name : Factor w/ 887 levels "Capt. Edward Gifford Crosby",..: 602 823 172 814 733 464 700 33 842 839 ... ## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ... ## $ Age : num 22 38 26 35 35 27 54 2 27 14 ... ## $ Siblings.Spouses.Aboard: int 1 1 0 1 0 0 0 3 0 1 ... ## $ Parents.Children.Aboard: int 0 0 0 0 0 0 0 1 2 0 ... ## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
The Pclass column is now correctly coded as a factor. We ignore the ordering as this goes beyond the scope of the course.
## ## Call: ## glm(formula = Survived ~ Age + Sex + Pclass, family = binomial(link = "logit")) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.681 -0.665 -0.414 0.637 2.450 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.17948 0.22784 5.18 2.3e-07 *** ## Age -0.03427 0.00716 -4.79 1.7e-06 *** ## Sexmale -2.58872 0.18701 -13.84 < 2e-16 *** ## Pclass2 1.25633 0.22646 5.55 2.9e-08 *** ## Pclass1 2.45544 0.25322 9.70 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 1182.77 on 886 degrees of freedom ## Residual deviance: 801.59 on 882 degrees of freedom ## AIC: 811.6 ## ## Number of Fisher Scoring iterations: 5
fit.interaction <- titanic %$% glm(Survived ~ Age * Sex * Pclass,
family = binomial(link="logit"))
fit.interaction %>% summary %>% .$coefficients
## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.3854 0.3516 1.096 0.2730 ## Age -0.0174 0.0140 -1.245 0.2132 ## Sexmale -1.2410 0.5197 -2.388 0.0169 ## Pclass2 3.6638 1.3814 2.652 0.0080 ## Pclass1 1.1122 1.4959 0.744 0.4572 ## Age:Sexmale -0.0226 0.0207 -1.094 0.2740 ## Age:Pclass2 -0.0320 0.0385 -0.831 0.4059 ## Age:Pclass1 0.0804 0.0528 1.521 0.1282 ## Sexmale:Pclass2 -1.9112 1.5759 -1.213 0.2252 ## Sexmale:Pclass1 0.8149 1.6602 0.491 0.6236 ## Age:Sexmale:Pclass2 -0.0313 0.0494 -0.633 0.5265 ## Age:Sexmale:Pclass1 -0.0800 0.0569 -1.407 0.1595
Now none of the main effects are significant. The variance (differences) have flowed into the interaction effects.
An interaction occurs when the (causal) effect of one predictor on the outcome depends on the level of the (causal) effect of another predictor.
E.g. the relation between body temperature and air temperature depends on the species.
To illustrate, I will limit this investigation to Age and Pclass for males only.
predict function to illustrate the conditional probabilities within each classTo do so, we need to create a new data frame that has all the combinations of predictors we need.
new <- data.frame(Pclass = factor(rep(c(1, 2, 3), c(80, 80, 80))),
Age = rep(1:80, times = 3),
Sex = rep("male", times = 240))
new <- cbind(new,
predict(fit.interaction, newdata = new,
type = "link", se = TRUE))
head(new)
## Pclass Age Sex fit se.fit residual.scale ## 1 1 1 male 1.032 0.596 1 ## 2 1 2 male 0.992 0.583 1 ## 3 1 3 male 0.952 0.569 1 ## 4 1 4 male 0.913 0.555 1 ## 5 1 5 male 0.873 0.542 1 ## 6 1 6 male 0.833 0.528 1
There are two simple approaches to obtain the predicted probabilities. First, we could simply ask for the predicted response:
new$prob <- predict(fit.interaction, newdata = new, type = "response")
Or we could use the distribution function plogis():
new$prob2 <- plogis(new$fit) head(new)
## Pclass Age Sex fit se.fit residual.scale prob prob2 ## 1 1 1 male 1.032 0.596 1 0.737 0.737 ## 2 1 2 male 0.992 0.583 1 0.729 0.729 ## 3 1 3 male 0.952 0.569 1 0.722 0.722 ## 4 1 4 male 0.913 0.555 1 0.714 0.714 ## 5 1 5 male 0.873 0.542 1 0.705 0.705 ## 6 1 6 male 0.833 0.528 1 0.697 0.697
new %<>%
mutate(lower = plogis(fit - 1.96 * se.fit),
upper = plogis(fit + 1.96 * se.fit))
head(new)
## Pclass Age Sex fit se.fit residual.scale prob prob2 lower upper ## 1 1 1 male 1.032 0.596 1 0.737 0.737 0.466 0.900 ## 2 1 2 male 0.992 0.583 1 0.729 0.729 0.463 0.894 ## 3 1 3 male 0.952 0.569 1 0.722 0.722 0.459 0.888 ## 4 1 4 male 0.913 0.555 1 0.714 0.714 0.456 0.881 ## 5 1 5 male 0.873 0.542 1 0.705 0.705 0.453 0.874 ## 6 1 6 male 0.833 0.528 1 0.697 0.697 0.450 0.866
A data frame with simulated Pclass and Age for males.
new %>% summary()
## Pclass Age Sex fit se.fit ## 1:80 Min. : 1.0 male:240 Min. :-7.37 Min. :0.159 ## 2:80 1st Qu.:20.8 1st Qu.:-3.27 1st Qu.:0.288 ## 3:80 Median :40.5 Median :-1.79 Median :0.416 ## Mean :40.5 Mean :-2.10 Mean :0.513 ## 3rd Qu.:60.2 3rd Qu.:-0.74 3rd Qu.:0.609 ## Max. :80.0 Max. : 1.03 Max. :1.607 ## residual.scale prob prob2 lower ## Min. :1 Min. :0.001 Min. :0.001 Min. :0.000 ## 1st Qu.:1 1st Qu.:0.037 1st Qu.:0.037 1st Qu.:0.012 ## Median :1 Median :0.143 Median :0.143 Median :0.083 ## Mean :1 Mean :0.213 Mean :0.213 Mean :0.136 ## 3rd Qu.:1 3rd Qu.:0.322 3rd Qu.:0.322 3rd Qu.:0.222 ## Max. :1 Max. :0.737 Max. :0.737 Max. :0.466 ## upper ## Min. :0.015 ## 1st Qu.:0.108 ## Median :0.250 ## Mean :0.312 ## 3rd Qu.:0.441 ## Max. :0.900
new %>% ggplot(aes(x = Age, y = fit)) + geom_line(aes(colour = Pclass), lwd = 1)
new %>%
ggplot(aes(x = Age, y = prob)) +
geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) +
geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")
To repeat the previous scenario for the model where only main effects are included:
fit <- titanic %$% glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit"))
new2 <- new %>% select(Age, Pclass, Sex)
new2 <- cbind(new2, predict(fit, newdata = new2, type = "link", se = TRUE))
new2 %<>%
mutate(prob = plogis(fit),
lower = plogis(fit - 1.96 * se.fit),
upper = plogis(fit + 1.96 * se.fit))
new2 %>% ggplot(aes(x = Age, y = fit)) + geom_line(aes(colour = Pclass), lwd = 1)
new2 %>%
ggplot(aes(x = Age, y = prob)) +
geom_ribbon(aes(ymin = lower, ymax = upper, fill = Pclass), alpha = .2) +
geom_line(aes(colour = Pclass), lwd = 1) + ylab("Probability of Survival")
anova(fit, fit.interaction, test = "Chisq")
## Analysis of Deviance Table ## ## Model 1: Survived ~ Age + Sex + Pclass ## Model 2: Survived ~ Age * Sex * Pclass ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 882 802 ## 2 875 755 7 46.4 7.3e-08 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The interaction model fits much better to the data.
AIC(fit, fit.interaction)
## df AIC ## fit 5 812 ## fit.interaction 12 779
pred <- predict(fit, type = "response") head(pred)
## 1 2 3 4 5 6 ## 0.1031 0.9115 0.5716 0.9195 0.0686 0.0883
pred <- ifelse(pred > 0.5, yes = 1, no = 0) head(pred)
## 1 2 3 4 5 6 ## 0 1 1 1 0 0
CM <- table(pred, titanic$Survived) CM
## ## pred 0 1 ## 0 463 100 ## 1 82 242
The accuracy can then be calculated as the percentage of correct predictions, i.e. the sum over the diagonal elements divided by the total number of cases:
correct <- CM %>% diag %>% sum correct / sum(CM)
## [1] 0.795
confusionMatrix(as.factor(pred), reference = as.factor(titanic$Survived))
## Confusion Matrix and Statistics ## ## Reference ## Prediction 0 1 ## 0 463 100 ## 1 82 242 ## ## Accuracy : 0.795 ## 95% CI : (0.767, 0.821) ## No Information Rate : 0.614 ## P-Value [Acc > NIR] : <2e-16 ## ## Kappa : 0.563 ## ## Mcnemar's Test P-Value : 0.208 ## ## Sensitivity : 0.850 ## Specificity : 0.708 ## Pos Pred Value : 0.822 ## Neg Pred Value : 0.747 ## Prevalence : 0.614 ## Detection Rate : 0.522 ## Detection Prevalence : 0.635 ## Balanced Accuracy : 0.779 ## ## 'Positive' Class : 0 ##
## Survived not.Survived ## Predicted Survived A B ## Predicted not.Survived C D
Survived that have been correctly predicted as Survived.not Survived that have been correctly predicted as not Survived.fit <- glm(Survived ~ Age + Sex + Pclass, family = binomial(link="logit"),
data = titanic)
set.seed(123)
cv <- CVbinary(fit)
## ## Fold: 3 10 2 6 5 4 9 8 7 1 ## Internal estimate of accuracy = 0.795 ## Cross-validation estimate of accuracy = 0.793
The accuracy after crossvalidation is more or less the same as the accuracy in the data as a whole. This indicates that the fitted model may be useful to new data that comes from the same population; i.e. the model may be generalized to new data.
The `CVbinary() function relies on K-fold crossvalidation. This type of crossvalidation relies on randomly splitting the data into folds (parts). To allow you to replicate these results I fix the random seed.
set.seed(123) CVbinary(fit, nfolds = 5)
## ## Fold: 3 2 5 4 1 ## Internal estimate of accuracy = 0.795 ## Cross-validation estimate of accuracy = 0.797
CVbinary(fit, nfolds = 2)
## ## Fold: 1 2 ## Internal estimate of accuracy = 0.795 ## Cross-validation estimate of accuracy = 0.782
helpIAmColourblind <- scale_color_manual(values = c("orange",
"blue",
"dark green"))
Several questions involving statistics:
What can I conclude about my population? (inference/hypothesis testing)
How can I show relevant patterns in my data? (dimension reduction / pattern recognition)
In supervised learning we aim to quantify the relation between \(Y\) and \(X\).
\(Y\):
\(X\):
We want to find the predictive function:
\[Y = f(X) + \epsilon \]
That minimizes \(\epsilon\) with respect to our goal.
Our aim is to find the \(f(X)\) that best represent the systematic information that \(X\) yields about \(Y\).
With supervised learning every observation on our predictor
\[x_i, i=1, \dots, n\]
has a corresponding outcome measurement
\[y_i\] such that
\[\hat{y_i}=f({\bf x_i})\quad \text{and} \quad y_i = f({\bf x_i})+\epsilon_i.\]
Examples:
With unsupervised learning we have a vector of measurement \(\bf x_i\) for every unit \(i=1, \dots, n\), but we miss the associated response \(y_i\).
There is no outcome to predict
There is no outcome to verify the model
Find patterns in \(\bf x_1, \dots, x_n\)
We can use this model to e.g. find out if some cases are more similar than other cases or which variables explain most of the variation
Examples:
Let’s create some data from a multivariate normal distribution
We start with fixing the random seed
set.seed(123)
and specifying the variance covariance matrix:
sigma <- matrix(c(1, .5, .5, 1), 2, 2)
rownames(sigma) <- colnames(sigma) <- c("x1", "x2")
sigma
## x1 x2 ## x1 1.0 0.5 ## x2 0.5 1.0
Because the variances are 1, the resulting data will have a correlation of \[\rho = \frac{\text{cov}(y, x)}{\sigma_y\sigma_x} = \frac{.5}{1\times1} = .5.\]
Let’s draw the data
sim.data <- mvtnorm::rmvnorm(n = 100,
mean = c(5, 5),
sigma = sigma)
colnames(sim.data) <- c("x1", "x2")
sim.data %>% as_tibble %>% ggplot(aes(x1, x2)) + geom_point()
sim.data <-
sim.data %>%
as_tibble %>%
mutate(class = sample(c("A", "B", "C"), size = 100, replace = TRUE))
We have added a new column that randomly assigns rows to level A, B or C
sim.data %>% head
## # A tibble: 6 x 3 ## x1 x2 class ## <dbl> <dbl> <chr> ## 1 4.40 4.63 C ## 2 6.52 5.47 C ## 3 5.57 6.69 C ## 4 5.12 3.90 A ## 5 4.22 4.39 A ## 6 6.28 5.66 C
sim.data %>% ggplot(aes(x1, x2, colour = class)) + geom_point() + helpIAmColourblind
sim.data <-
sim.data %>%
mutate(x2 = case_when(class == "A" ~ x2 + 1.5,
class == "B" ~ x2 - 1.5,
class == "C" ~ x2 + 1.5),
x1 = case_when(class == "A" ~ x1 - 1.5,
class == "B" ~ x1 - 0,
class == "C" ~ x1 + 1.5))
sim.data %>% ggplot(aes(x1, x2, colour = class)) + geom_point() + helpIAmColourblind
sim.data %>% ggplot(aes(x1, x2)) + geom_point()
We estimate the probability for \(x_0\) being part of class \(j\) as the fraction of points in \(\mathcal{N}_0\) for whom the response equals \(j\): \[P(Y = j | X = x_0) = \frac{1}{K}\sum_{i\in\mathcal{N}_0}I(y_i=j)\]
The observation \(x_0\) is classified to the class with the largest probability
An observation gets that class assigned to which most of its \(K\) neighbours belong
Because \(X\) is assigned to the class to which most of the observations belong it is
non-parametric
expected to be far better than logistic regression when decision boundaries are non-linear
However, we do not get parameters as with LDA and regression.
First we need to determine a training set
set.seed(123)
sim.data <-
sim.data %>%
mutate(set = sample(c("Train", "Test"), size = 100,
prob = c(.25, .75), replace = TRUE))
sim.data
## # A tibble: 100 x 4 ## x1 x2 class set ## <dbl> <dbl> <chr> <chr> ## 1 5.90 6.13 C Test ## 2 8.02 6.97 C Train ## 3 7.07 8.19 C Test ## 4 3.62 5.40 A Train ## 5 2.72 5.89 A Train ## 6 7.78 7.16 C Test ## 7 5.42 3.71 B Test ## 8 6.43 8.08 C Train ## 9 4.97 1.73 B Test ## 10 4.06 6.22 A Test ## # … with 90 more rows
Then we split the data into a training (build the model) and a test (verify the model) set
train.data <- subset(sim.data, set == "Train", select = c(x1, x2)) test.data <- subset(sim.data, set == "Test", select = c(x1, x2)) obs.class <- subset(sim.data, set == "Train", select = class)
Now we can fit the K-NN model
fit.knn <- knn(train = train.data,
test = test.data,
cl = as.matrix(obs.class),
k = 3)
fit.knn
## [1] C C C B B A B B A A C C A A C A A C C C B C C B B A B B B B A B A B A ## [36] C A C C B C C C A A C B C B A A B B C C A B B C C C B A B B C B A C A ## [71] C B A ## Levels: A B C
class.test <- subset(sim.data, set == "Test", select = class) %>% as.matrix() correct <- fit.knn == class.test mean(correct)
## [1] 0.959
table(fit.knn, class.test)
## class.test ## fit.knn A B C ## A 21 0 0 ## B 0 24 1 ## C 0 2 25
cbind(test.data, correct) %>%
ggplot(aes(x1, x2, colour = correct)) +
geom_point() +
scale_colour_manual(values = c("red", "black"))
fit.knn <- knn(train = train.data,
test = test.data,
cl = as.matrix(obs.class),
k = 2)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.945
table(fit.knn, class.test)
## class.test ## fit.knn A B C ## A 21 1 1 ## B 0 23 0 ## C 0 2 25
fit.knn <- knn(train = train.data,
test = test.data,
cl = as.matrix(obs.class),
k = 4)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.945
table(fit.knn, class.test)
## class.test ## fit.knn A B C ## A 21 0 1 ## B 0 24 1 ## C 0 2 24
fit.knn <- knn(train = train.data,
test = test.data,
cl = as.matrix(obs.class),
k = 10)
correct <- fit.knn == class.test
mean(correct)
## [1] 0.89
table(fit.knn, class.test)
## class.test ## fit.knn A B C ## A 21 1 5 ## B 0 25 2 ## C 0 0 19
cbind(test.data, correct) %>%
ggplot(aes(x1, x2, colour = correct)) +
geom_point() +
scale_colour_manual(values = c("red", "black"))
Let’s make a new observation:
newObs <- data.frame(x1 = 5.5, x2 = 4.5)
Now we predict the class of this new observation, using the entire data for training our model
knn(train = sim.data[, 1:2], cl = sim.data$class, k = 10, test = newObs)
## [1] B ## Levels: A B C
sim.data %>% ggplot(aes(x1, x2)) + geom_point()
K-means clustering partitions our dataset into \(K\) distinct, non-overlapping clusters or subgroups.
A set of relatively similar observations.
This is up to the programmer/researcher to decide. For example, we can say the “within-class” variance is as small as possible and the between-class variance as large as possible.
We expect clusters in our data, but weren’t able to measure them
We want to summarise features into a categorical variable to use in further decisions/analysis
colMeans) for each classSource: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). New York: Springer.
K is a tuning parameter (centers)
(fitkm <- kmeans(sim.data[, 1:2], centers = 3))
## K-means clustering with 3 clusters of sizes 39, 35, 26 ## ## Cluster means: ## x1 x2 ## 1 4.99 3.53 ## 2 3.59 6.39 ## 3 6.85 6.98 ## ## Clustering vector: ## [1] 3 3 3 2 2 3 1 3 1 2 1 1 1 2 2 2 3 3 2 2 3 2 3 1 2 2 3 1 2 1 2 1 3 1 3 ## [36] 1 2 1 2 1 1 1 1 2 1 2 1 2 3 1 2 3 1 1 1 3 2 2 1 1 2 2 3 1 3 1 1 1 2 1 ## [71] 2 2 2 1 1 3 3 2 1 1 3 3 3 3 1 2 2 1 2 1 1 3 1 2 3 2 2 3 1 2 ## ## Within cluster sum of squares by cluster: ## [1] 49.3 46.5 49.2 ## (between_SS / total_SS = 73.0 %) ## ## Available components: ## ## [1] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" "iter" ## [9] "ifault"
sim.data$clust <- as.factor(fitkm$cluster) sim.data %>% ggplot + geom_point(aes(x = x1, y = x2, colour = clust)) + helpIAmColourblind
# this is the data-generating class sim.data %>% ggplot + geom_point(aes(x = x1, y = x2, colour = class)) + helpIAmColourblind
sim.data %>% ggplot +
geom_point(aes(x = x1, y = x2, colour = clust)) +
geom_point(aes(x = x1, y = x2), data = as.data.frame(fitkm$centers),
size = 5, col = "red", alpha = 0.8) +
helpIAmColourblind